Code
#knitr::opts_chunk$set(include = FALSE) # for making prompts
#knitr::opts_chunk$set(echo = TRUE) # for making solutions
library(tidyverse)
library(modeldata)
library(leaps)
library(caret)
library(corrplot)
library(ggplot2)
library(viridis)Jorge Bris Moreno
Assignment by: Dr. Purna Gamage
Instructions
.qmd notebook (or .ipynb if you prefer, both versions are identical)Submission:
Optional:
R now inside VS-code using the .ipynb format.ipynb with python or rmd in R-studio.convert command, you can re-format and jump between the different file-formats. For example,quarto convert HW-2.rmd will convert the file to HW-2.ipynbquarto convert HW-2.ipynb will convert the file to HW-2.qmd, which can be renamed HW-2.rmd or just opened in R-studio like any other rmd file, just like normal.quarto render HW-2.ipynb will render the notebook (either R or Python) into an aesthetically pleasing output.Use best subset selection methods to determine the best heating values
For bioenergy production, the heating value is a measure of the amount of heat released during combustion. The Higher heating value (HHV) is a particular method for determining the heat released during combustion. The higher HHV the more energy released for a given amount of material. You will use the biomass dataset from the modeldata package. Run a ?biomass after importing the data to read about the domain. The response variable is HHV and the predictor variables are the percentages of different elements. Do not include the sample and dataset variables in your analysis.
HW-2.1.a Create scatter-plots of the response and predictor variables and comment on your findings.
[1] "data.frame"
[1] 536 8
sample dataset carbon hydrogen oxygen nitrogen sulfur HHV
1 Akhrot Shell Training 49.81 5.64 42.94 0.41 0.00 20.008
2 Alabama Oak Wood Waste Training 49.50 5.70 41.30 0.20 0.00 19.228
3 Alder Training 47.82 5.80 46.25 0.11 0.02 18.299
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
We can see from these scatter plots that carbon has the strongest correlation with HHV, but oxygen and hydrogen seem to also have some. For Nitrogen and sulfur is hard to tell. Thus, we will also be checking this with a correlation plot for additional visualization.
This proves what we mention, and oxigen and hydrogen still seem to have a “strong” correlation with HHV.
HW-2.1.b Split the dataset into an 80-20 training and test sets.
[1] "Current training labelled: 0.850746268656716"
[1] "Current testing labelled: 0.149253731343284"
We will drop the column dataset as we are not using it and it does not have the training or test quantities we want, and then split the data.
[1] "sample" "carbon" "hydrogen" "oxygen" "nitrogen" "sulfur" "HHV"
We will also drop column sample as it is unique for each observatioin and it is categorical.
HW-2.1.c Use regsubsets() to perform best subset selection to pick the best model according to \(C_p\), BIC, and adjusted \(R^2\).
par(mfrow = c(2, 2))
plot(summary_result$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary_result$cp), summary_result$cp[which.min(summary_result$cp)], col = "red", cex = 2, pch = 20)
plot(summary_result$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary_result$bic), summary_result$bic[which.min(summary_result$bic)], col = "red", cex = 2, pch = 20)
plot(summary_result$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary_result$adjr2), summary_result$adjr2[which.max(summary_result$adjr2)], col = "red", cex = 2, pch = 20)It seems that a three variable model is our best pick from this graphs.
HW-2.1.d Repeat this procedure for forward stepwise selection and backward stepwise selection, compare the best models from each selection method.
# INSERT CODE
# Plot forward
par(mfrow = c(2, 2))
plot(forward_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(forward_summary$cp), forward_summary$cp[which.min(forward_summary$cp)], col = "red", cex = 2, pch = 20)
plot(forward_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(forward_summary$bic), forward_summary$bic[which.min(forward_summary$bic)], col = "red", cex = 2, pch = 20)
plot(forward_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(forward_summary$adjr2), forward_summary$adjr2[which.max(forward_summary$adjr2)], col = "red", cex = 2, pch = 20)
# Plot Backwards
par(mfrow = c(2, 2))plot(backward_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(backward_summary$cp), backward_summary$cp[which.min(backward_summary$cp)], col = "red", cex = 2, pch = 20)
plot(backward_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(backward_summary$bic), backward_summary$bic[which.min(backward_summary$bic)], col = "red", cex = 2, pch = 20)
plot(backward_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(backward_summary$adjr2), backward_summary$adjr2[which.max(backward_summary$adjr2)], col = "red", cex = 2, pch = 20)We can see that these methods also say that three variable model is our best option.
HW-2.1.e Use the predict() function to investigate the test performance in RMSE using your “best model”.
# INSERT CODE
# Select the best model for each
backward_best_index <- which.max(backward_summary$adjr2)
forward_best_index <- which.max(forward_summary$adjr2)
# Best variables
backward_best_vars <- names(coef(backward_model, id = backward_best_index))
forward_best_vars <- names(coef(forward_model, id = forward_best_index))
# Without intercept
backward_best_vars <- backward_best_vars[backward_best_vars != "(Intercept)"]
forward_best_vars <- forward_best_vars[forward_best_vars != "(Intercept)"]
# formulas so I can fit lm
backward_formula_str <- paste("HHV ~", paste(backward_best_vars, collapse=" + "))
forward_formula_str <- paste("HHV ~", paste(forward_best_vars, collapse=" + "))
backward_formula <- as.formula(backward_formula_str)
forward_formula <- as.formula(forward_formula_str)
# Fit model
backwards_best_model_fit <- lm(backward_formula, data=training_set)
forwards_best_model_fit <- lm(forward_formula, data=training_set)
# preds
predictions_back <- predict(backwards_best_model_fit, newdata=testing_set)
predictions_forw <- predict(forwards_best_model_fit, newdata=testing_set)
# RMSE
rmse_back <- sqrt(mean((testing_set$HHV - predictions_back)^2, na.rm = TRUE))
rmse_forw <- sqrt(mean((testing_set$HHV - predictions_forw)^2, na.rm = TRUE))
# RMSE
print(paste("Backward RMSE:", rmse_back))[1] "Backward RMSE: 1.43045828417102"
[1] "Forward RMSE: 1.43045828417102"
Create your own cross validation algorithm to predict the interest rate for loan applications
Lending club gained fame for being one of the first major players in retail lending. The dataset lending_club in the model_data package includes 9,857 loans that were provided. Interest rates of loans are often a good indicator of the level of risk associated with lending. If you are likely to pay back a loan, then you will likely be charged lower interest than someone who has a higher chance of default. Your goal is to determine the best model for predicting the interest rate charged to borrowers using best, forward, and backward subset selection within a five-fold cross-validation framework.
Prep steps: - drop all rows with missing data in the following columns
HW-2.2.a Create a correlation plot of all the numeric variables in the dataset using the corrplot package to create a high quality graph, then comment on your findings
# INSERT CODE
data("lending_club", package = "modeldata")
loan <- lending_club
numeric_vars <- select_if(loan, is.numeric)
cor_matrix <- cor(loan[, sapply(loan, is.numeric)])
corrplot(cor_matrix, method = "circle", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45,
addCoef.col = "black",
diag = FALSE)While there seems to be correlation between other different variables, interest rate is correlated with many others but at a very low scale, the most being with all_util with 0.29. All the other correlations are between $$0.25 which are not very strong correlations. This may make having a good model more complicated, but it can still be done (up to a certain point).
HW-2.2.b Run best, forward, and backward subset selection on the entire dataset comment on the findings
We will retrieve only the numeric variables (so taking away state) as professors said its ok since it will run for too long (even if we do one hot encoding).
# INSERT CODE
# Best
best_subset_fit <- regsubsets(int_rate ~ ., data=numeric_loan, nvmax=NULL, method="exhaustive", really.big = T)
summary_best <- summary(best_subset_fit)
#summary_best
par(mfrow = c(2, 2))
plot(summary_best$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary_best$cp), summary_best$cp[which.min(summary_best$cp)], col = "red", cex = 2, pch = 20)
plot(summary_best$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary_best$bic), summary_best$bic[which.min(summary_best$bic)], col = "red", cex = 2, pch = 20)
plot(summary_best$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary_best$adjr2), summary_best$adjr2[which.max(summary_best$adjr2)], col = "red", cex = 2, pch = 20)# INSERT CODE
# Forward
forward_fit <- regsubsets(int_rate ~ ., data=numeric_loan, method="forward", nvmax=NULL)
summary_forward <- summary(forward_fit)
#summary_forward
par(mfrow = c(2, 2))
plot(summary_forward$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary_forward$cp), summary_forward$cp[which.min(summary_forward$cp)], col = "red", cex = 2, pch = 20)
plot(summary_forward$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary_forward$bic), summary_forward$bic[which.min(summary_forward$bic)], col = "red", cex = 2, pch = 20)
plot(summary_forward$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary_forward$adjr2), summary_forward$adjr2[which.max(summary_forward$adjr2)], col = "red", cex = 2, pch = 20)# INSERT CODE
# backward
backward_fit <- regsubsets(int_rate ~ ., data=numeric_loan, method="backward", nvmax=NULL)
summary_backward <- summary(backward_fit)
#summary_backward
par(mfrow = c(2, 2))
plot(summary_backward$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary_backward$cp), summary_backward$cp[which.min(summary_backward$cp)], col = "red", cex = 2, pch = 20)
plot(summary_backward$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary_backward$bic), summary_backward$bic[which.min(summary_backward$bic)], col = "red", cex = 2, pch = 20)
plot(summary_backward$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary_backward$adjr2), summary_backward$adjr2[which.max(summary_backward$adjr2)], col = "red", cex = 2, pch = 20)They all seem to agree that 10 variables is the best. While some variable will add better predictions for some of the measurements, the better value is so little that we can agree on 10.
HW-2.2.c Create a five-fold cross-validation algorithm using for loops to compare the CV mse performance of your best two models
Attaching package: 'boot'
The following object is masked from 'package:lattice':
melanoma
#| vscode: {languageId: r}
# INSERT CODE
# Selecting the model based on CP as it is between BIC and r^2adj
best_bic_backward_index <- which.min(summary_backward$cp)
best_vars_backward <- names(coef(backward_fit, id = best_bic_backward_index))
best_bic_best_subset_index <- which.min(summary_best$cp)
best_vars_best_subset <- names(coef(best_subset_fit, id = best_bic_best_subset_index))
backward_formula <- as.formula(paste("int_rate ~", paste(best_vars_backward[-1], collapse = " + ")))
best_subset_formula <- as.formula(paste("int_rate ~", paste(best_vars_best_subset[-1], collapse = " + ")))# INSERT CODE
set.seed(4444)
folds <- cut(seq(1, nrow(numeric_loan)), breaks=5, labels=FALSE)
cv_mse_backward <- numeric(5)
cv_mse_best_subset <- numeric(5)
for(i in 1:5){
test_indices <- which(folds == i)
train_indices <- setdiff(seq_len(nrow(numeric_loan)), test_indices)
train_set <- numeric_loan[train_indices, ]
test_set <- numeric_loan[test_indices, ]
# Fit both models
fit_backward <- glm(backward_formula, data = train_set)
fit_best_subset <- glm(best_subset_formula, data = train_set)
# Preds of both
predictions_backward <- predict(fit_backward, newdata = test_set)
predictions_best_subset <- predict(fit_best_subset, newdata = test_set)
# MSE
cv_mse_backward[i] <- mean((test_set$int_rate - predictions_backward)^2, na.rm = TRUE)
cv_mse_best_subset[i] <- mean((test_set$int_rate - predictions_best_subset)^2, na.rm = TRUE)
}
# Avg MSE
avg_cv_mse_backward <- mean(cv_mse_backward)
avg_cv_mse_best_subset <- mean(cv_mse_best_subset)
print(paste("Average CV MSE for Backward Selection:", avg_cv_mse_backward))[1] "Average CV MSE for Backward Selection: 17.5556682537638"
[1] "Average CV MSE for Exhaustive Selection: 17.5556682537638"
Using cross-validation to select best advertising budget
In this problem, we use the Advertising data download here. We want to predict Sales from TV, Radio and Newspaper, using multiple regression with all three predictors plus up to one interaction term of these three predictors, e.g. TV * Radio or Radio * Newspaper.
HW-2.4.a Should such an interaction term be included? Which one? Try to answer this question by estimating the residual standard error using 10-fold cross validation for all four possible models.
Possible models:
Sales \(\sim\) TV + Radio + Newspaper
Sales \(\sim\) TV + Radio + Newspaper + TV:Radio
Sales \(\sim\) TV + Radio + Newspaper + TV:Newspaper
Sales \(\sim\) TV + Radio + Newspaper + Radio:Newspaper
New names:
Rows: 200 Columns: 5
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(5): ...1, TV, radio, newspaper, sales
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# list possible models
models <- list(
model1 = sales ~ TV + radio + newspaper,
model2 = sales ~ TV + radio + newspaper + TV:radio,
model3 = sales ~ TV + radio + newspaper + TV:newspaper,
model4 = sales ~ TV + radio + newspaper + radio:newspaper
)
# for storing results
cv_results <- vector("list", length(models))
names(cv_results) <- c("Base", "TV:Radio", "TV:Newspaper", "Radio:Newspaper")
# loop around possible models
for (i in seq_along(models)) {
glm_model_selected <- glm(formula = models[[i]], data = df)
cv_result <- cv.glm(df, glm_model_selected, K = 10)
cv_results[[i]] <- list(cv_error = cv_result$delta[1])
}
# Results fixing diplay (look like a data frame / tibble)
cv_results_df <- bind_rows(cv_results, .id = "Model")
# Print the results
print(cv_results_df)# A tibble: 4 × 2
Model cv_error
<chr> <dbl>
1 Base 2.93
2 TV:Radio 0.943
3 TV:Newspaper 2.89
4 Radio:Newspaper 3.06
TV interacting with radia should be included as the error is significantly decreased.
HW-2.4.b Create a single plot showing the return on investment of each advertising method where the y-axis is Sales and the x-axis is advertising dollars. There should be three lines, one for each method. The slope is the coefficient from you regression. What is the best advertising method to invest in based on return on investment?
# INSERT CODE
# Models
model_tv <- lm(sales ~ TV, data = df)
model_radio <- lm(sales ~ radio, data = df)
model_newspaper <- lm(sales ~ newspaper, data = df)
coeff_tv <- coef(model_tv)
coeff_radio <- coef(model_radio)
coeff_newspaper <- coef(model_newspaper)
ad_dollars <- seq(from = 0, to = max(df$TV, df$radio, df$newspaper), by = 1)
sales_tv <- coeff_tv[1] + coeff_tv[2] * ad_dollars
sales_radio <- coeff_radio[1] + coeff_radio[2] * ad_dollars
sales_newspaper <- coeff_newspaper[1] + coeff_newspaper[2] * ad_dollars
plot_data <- tibble(
AdvertisingDollars = c(ad_dollars, ad_dollars, ad_dollars),
sales = c(sales_tv, sales_radio, sales_newspaper),
Method = factor(rep(c("TV", "radio", "newspaper"), each = length(ad_dollars)))
)
# Plot
ggplot(plot_data, aes(x = AdvertisingDollars, y = sales, color = Method)) +
geom_line() +
labs(title = "ROI on advertisement method",
x = "Advertising $",
y = "Sales",
color = "Method") +
my_theme()The best on ROI seems to be radio, so we should recommend investing more in radio advertisements.
# INSERT CODE
ex_5_df <- tibble(X, Y)
model <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = ex_5_df)
summary <- summary(model)
CP = which.min(summary$cp)
BIC = which.min(summary$bic)
R2 = which.max(summary$adjr2)
best_vars_CP <- names(coef(model, id = CP))
best_vars_BIC <- names(coef(model, id = BIC))
best_vars_R2 <- names(coef(model, id = R2))
cat("Best subets:","\n")Best subets:
Cp: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10
BIC: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)3
Adjusted R^2: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10
# INSERT CODE
par(mfrow = c(2, 2))
plot(summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary$cp), summary$cp[which.min(summary$cp)], col = "red", cex = 2, pch = 20)
plot(summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary$bic), summary$bic[which.min(summary$bic)], col = "red", cex = 2, pch = 20)
plot(summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary$adjr2), summary$adjr2[which.max(summary$adjr2)], col = "red", cex = 2, pch = 20)# INSERT CODE
model <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = ex_5_df, method = "forward")
summary <- summary(model)
CP = which.min(summary$cp)
BIC = which.min(summary$bic)
R2 = which.max(summary$adjr2)
best_vars_CP <- names(coef(model, id = CP))
best_vars_BIC <- names(coef(model, id = BIC))
best_vars_R2 <- names(coef(model, id = R2))
cat("Best subets:","\n")Best subets:
Cp: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)3
BIC: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)3
Adjusted R^2: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)3
# INSERT CODE
par(mfrow = c(2, 2))
plot(summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary$cp), summary$cp[which.min(summary$cp)], col = "red", cex = 2, pch = 20)
plot(summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary$bic), summary$bic[which.min(summary$bic)], col = "red", cex = 2, pch = 20)
plot(summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary$adjr2), summary$adjr2[which.max(summary$adjr2)], col = "red", cex = 2, pch = 20)# INSERT CODE
model <- regsubsets(Y ~ poly(X, 10, raw = TRUE), data = ex_5_df, method = "backward")
summary <- summary(model)
CP = which.min(summary$cp)
BIC = which.min(summary$bic)
R2 = which.max(summary$adjr2)
best_vars_CP <- names(coef(model, id = CP))
best_vars_BIC <- names(coef(model, id = BIC))
best_vars_R2 <- names(coef(model, id = R2))
cat("Best subets:","\n")Best subets:
Cp: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10
BIC: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10
Adjusted R^2: poly(X, 10, raw = TRUE)1, poly(X, 10, raw = TRUE)2, poly(X, 10, raw = TRUE)5, poly(X, 10, raw = TRUE)7, poly(X, 10, raw = TRUE)9, poly(X, 10, raw = TRUE)10
# INSERT CODE
par(mfrow = c(2, 2))
plot(summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(which.min(summary$cp), summary$cp[which.min(summary$cp)], col = "red", cex = 2, pch = 20)
plot(summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(summary$bic), summary$bic[which.min(summary$bic)], col = "red", cex = 2, pch = 20)
plot(summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(summary$adjr2), summary$adjr2[which.max(summary$adjr2)], col = "red", cex = 2, pch = 20)While backward method suggests a model with more variables than the one in part c and the one using the calculated the forward method, these two actually agree with each other. Thus, I would pick three variables as it still does almost as the optimal one calculated using the backward method and the other two agree with the claim.